Druid (open-source Data Store)
   HOME

TheInfoList



OR:

Druid is a column-oriented,
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
,
distributed Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...
data store In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted. ...
written in
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.Hemsoth, Nicole. , ''Datanami'', 8 November 2012 The name Druid comes from the
shapeshifting In mythology, folklore and speculative fiction, shape-shifting is the ability to physically transform oneself through an inherently superhuman ability, divine intervention, demonic manipulation, Magic (paranormal), sorcery, Incantation, ...
Druid class in many
role-playing game A role-playing game (sometimes spelled roleplaying game, RPG) is a game in which players assume the roles of player character, characters in a fictional Setting (narrative), setting. Players take responsibility for acting out these roles within ...
s, to reflect that the architecture of the system can shift to solve different types of data problems. Druid is commonly used in
business intelligence Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information. Common functions of business intelligence technologies include reporting, online analytical pr ...
-
OLAP Online analytical processing, or OLAP (), is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, repor ...
applications to analyze high volumes of
real-time Real-time or real time describes various operations in computing or other processes that must guarantee response times within a specified time (deadline), usually a relatively short time. A real-time process is generally one that happens in defined ...
and historical data. Druid is used in production by technology companies such as
Alibaba Ali Baba (character), Ali Baba is a character from the folk tale ''Ali Baba and the Forty Thieves''. Ali Baba or Alibaba may also refer to: Films * Ali Baba and the Forty Thieves (1902 film), ''Ali Baba and the Forty Thieves'' (1902 film), a F ...
,
Airbnb Airbnb, Inc. ( ), based in San Francisco, California, operates an online marketplace focused on short-term homestays and experiences. The company acts as a broker and charges a commission from each booking. The company was founded in 2008 b ...
,
Cisco Cisco Systems, Inc., commonly known as Cisco, is an American-based multinational digital communications technology conglomerate corporation headquartered in San Jose, California. Cisco develops, manufactures, and sells networking hardware, ...
,
eBay eBay Inc. ( ) is an American multinational e-commerce company based in San Jose, California, that facilitates consumer-to-consumer and business-to-consumer sales through its website. eBay was founded by Pierre Omidyar in 1995 and became a ...
,
Lyft Lyft, Inc. offers mobility as a service, ride-hailing, vehicles for hire, motorized scooters, a bicycle-sharing system, rental cars, and food delivery in the United States and select cities in Canada. Lyft sets fares, which vary using a dynamic ...
,
Netflix Netflix, Inc. is an American subscription video on-demand over-the-top streaming service and production company based in Los Gatos, California. Founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, it offers a fil ...
,
PayPal PayPal Holdings, Inc. is an American multinational financial technology company operating an online payments system in the majority of countries that support online money transfers, and serves as an electronic alternative to traditional paper ...
,
Pinterest Pinterest is an American image sharing and social media service designed to enable saving and discovery of information (specifically "ideas") on the internet using images, and on a smaller scale, animated GIFs and videos, in the form of pinboard ...
,
Reddit Reddit (; stylized in all lowercase as reddit) is an American social news aggregation, content rating, and discussion website. Registered users (commonly referred to as "Redditors") submit content to the site such as links, text posts, images ...
,
Twitter Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
,
Walmart Walmart Inc. (; formerly Wal-Mart Stores, Inc.) is an American multinational retail corporation that operates a chain of hypermarkets (also called supercenters), discount department stores, and grocery stores from the United States, headquarter ...
,
Wikimedia Foundation The Wikimedia Foundation, Inc., or Wikimedia for short and abbreviated as WMF, is an American 501(c)(3) nonprofit organization headquartered in San Francisco, California and registered as a charitable foundation under local laws. Best kno ...
and
Yahoo Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo! Inc. (2017–present), Yahoo Inc., which is 90% owned by investment funds ma ...
.


History

Druid was started in 2011 by Eric Tschetter, Fangjin Yang, Gian Merlino and Vadim Ogievetsky to power the analytics product of Metamarkets. The project was open-sourced under the GPL license in October 2012, Tschetter, Eric. , ''druid.apache.org'', 24 October 2012Higginbotham, Stacey. , ''
GigaOM Gigaom is a technology focused analyst firm and media company. The company evolved from a blog which offered news, analysis, and opinions on startup companies, emerging technologies, and other technology related topics. It was started by Om Malik ...
'', 24 October 2012
and moved to an Apache License in February 2015.


Architecture

Fully deployed, Druid runs as a cluster of specialized processes (called nodes in Druid) to support a
fault-tolerant Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of one or more faults within some of its components. If its operating quality decreases at all, the decrease is proportional to the ...
architecture where data is stored redundantly, and there is no single point of failure. The cluster includes external dependencies for coordination (
Apache ZooKeeper Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation. ZooKeeper is essentially a service for distributed systems offering a hierarchical ...
), metadata storage (e.g.
MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...
,
PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
, or
Derby Derby ( ) is a city and unitary authority area in Derbyshire, England. It lies on the banks of the River Derwent in the south of Derbyshire, which is in the East Midlands Region. It was traditionally the county town of Derbyshire. Derby gai ...
), and a deep storage facility (e.g.
HDFS Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage a ...
, or
Amazon S3 Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e- ...
) for permanent data backup.


Query management

Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.


Cluster management

Operations relating to data management in historical nodes are overseen by coordinator nodes. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.


Features

* Low latency (streaming) data ingestion * Arbitrary slice and dice data exploration * Sub-second analytic queries * Approximate and exact computations


Performance

Researchers have compared the performance of
Hive A hive may refer to a beehive, an enclosed structure in which some honey bee species live and raise their young. Hive or hives may also refer to: Arts * ''Hive'' (game), an abstract-strategy board game published in 2001 * "Hive" (song), a 201 ...
, Presto, and Druid using a denormalized
Star Schema In computing, the star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables referencing any number of dim ...
Benchmark based on the TPC-H standard. Druid was tested using both a “Druid Best” configuration using tables with hashed partitions and a “Druid Suboptimal” configuration which does not use hashed partitions. Tests were conducted by running the 13 TPC-H queries using TPC-H Scale Factor 30 (a 30GB database), Scale Factor 100 (a 100GB database), and Scale Factor 300 (a 300GB database). Druid performance was measured as at least 98% faster than Hive and at least 90% faster than Presto in each scenario, even when using the Druid Suboptimized configuration.


See also

*
List of column-oriented DBMSes This article is a list of Column-oriented DBMS, column-oriented database management system software. Free and open-source software (FOSS) Platform as a Service (PaaS) *Amazon Redshift * Microsoft Azure SQL Data Warehouse * Google BigQuery ...


References


External links

* {{Apache Software Foundation
Druid A druid was a member of the high-ranking class in ancient Celtic cultures. Druids were religious leaders as well as legal authorities, adjudicators, lorekeepers, medical professionals and political advisors. Druids left no written accounts. Whi ...
Distributed data stores Structured storage NoSQL Free database management systems